Assignment: Trees and Forests

In this assignment, I would like you to develop a random forest to classify the MNIST dataset. Here are the basic requirements:

  1. This is a multi-class problem: the random forest should classify all 10 digits.
  2. As we did with the pulsar data, I want you to use k-fold validation, as well as the error curve, to determine the optimal depth for the trees used in the forest.
  3. I want you to look at feature performance. In the case of MNIST, the features are the pixels. I would like you to determine the ranked importance of all of the pixels, and then make a heatmap (using plotly) along the lines of the plot in the book (Chapter 7, figure 7-6).

This assignment is essentially a combination of parts of the "module3_2_multiclassV2.ipynb" workbook, as well as the "module4_2_random_forest.ipynb" workbook.

NOTE: This whole assignment can be done using the "short" dataset. You can (if you want) use the full dataset, but it is not necessary. It will take much longer to run if you do!

Use the structure below to craft your solution.

Task 1: Get the data

Task 2: Get the appropriate methods from earlier notebooks

The Performance Method.

Get this from the module3_2_multiclassV2 workbook. Remember that the multiclass performance method does not return AUC as a performance metric. I suggest using "macro accuracy".

The Runfitter Method

Get this from the module3_2_multiclassV2 notebook.

Task 3: Prepare the data

Get this also from module3_2_multiclassV2.

As in that case, we will split the data into two datasets:

Remember that the features are the first 784 columns, and the labels are given by the "digit" column.

Task 4: Run the fitter once

Use a reasonable set of parameters for the forest estimator: for example, n_estimators=100, max_depth=5. Then perform one run of k-fold (with k=5) validation to see what sort of accuracy you get.

Print the average accuracy from the 5 kfolds for each digit.

An overall average macro accuracy of 86% on the test set is reasonable with these parameters.

Task 5: Finding the Optimal max_depth

Here you want to loop from max_depth=2 to 22. You could do this in increments of 2 (4,6,8..,20,22) to save time.

Use n_estimators=50 to save even more time.

Look at the "Overfitting vs Underfitting" section of module4_2_random_forest.ipynb. However, in your dfError object, you will only need to save "1-avg_accuracyMacro" for test and train.

Task 6: Plot accuracy vs max_depth

This plot should have both train and test results on it.

Task 7: Feature importance

As noted above, we want to find the feature importance of the pixels. However, we want to do this with the final model. What is the final model? It is the model with the ideal max_depth that we determined from Task 6. Choose the max_depth where the test error is small and the model is simplest (where the test accuracy begins to plateau).

So you will want to re-train the model, using this max_depth, as well as the full data. Look at how this is done at the end of "module4_2_random_forest.ipynb".

Also, to plot the feature importance of the pixels, you will have to reshape the returned importances (with are of length 784) to an array of shape (28,28). Then you can plot them as a heatmap. For plotly express this is best done using px.imshow.

Also, are you sure that the orientation of the heatmap is correct? Plot a sinlge digit (like X_train[0].reshape(28,28)) to check this.

Extra Credit: Explaining the Feature Importance

The shape of the feature importance - as see from the imshow heatmap - might make sense. But is there a way to display this somehow. We might expect that the more a given pixel is used might be related to how often it is used across all of the digits. Let's see if this is the case.

Find a way to calculate the average occupancy of every pixel, across all of the images. Then: